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Abstract 

Model selection consistency in the high-dimensional regression setting 
can be achieved only if strong assumptions are fulfilled. We therefore suggest 
to pursue a different goal, which we call a minimal class of models. The 
minimal class of models includes models that are similar in their prediction 
accuracy but not necessarily in their elements. We suggest a random search 
algorithm to reveal candidate models. The algorithm implements simulated 
annealing while using a score for each predictor that we suggest to derive 
using a combination of the Lasso and the Elastic Net. The utility of using 
a minimal class of models is demonstrated in the analysis of two datasets. 

Keywords. Model selection; High-dimensional data; Lasso; Elastic-Net; Simu¬ 
lated annealing 

1 Introduction 

High dimensional statistical problems have been arising as a result of the vast 
amount of data gathered today. A more specihc problem is that estimation of the 
usual linear regression coefficients vector cannot be performed when the number of 
predictors exceeds the number of observations. Therefore, a sparsity assumption 
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is often added. For example, the number of regression coefficients that are not 
equal to zero is assumed to be small. If it was known in advance which predictors 
have non zero coefficients, the classical linear regression estimator could have been 
used. Unfortunately, it is not known. Even worse, the natural relevant discrete 
optimization problem is usually not computationally feasible. 


The Lasso estimator, Tibshirani (1996), which solves the problem of mini¬ 
mizing prediction error together with a £i—norm penalty, is possibly the most 
popular method to address this problem, since it results in a sparse estimator. 
Various algorithms are available to compute this estimator [e.g.. 


Friedman et ah 


(2010)|. The theoretical properties of the Lasso have been throughly researched 


in the last decade. For the high-dimension problem, prediction rates were estab¬ 
lished in various manners, Greenshtein fc Ritov (200^ Bunea et ah (2006) Bickel 


et al. (2009) Bunea et ah (2007) Meinshausen & Yu (2009) The capability of 


the Lasso to choose the correct model depends on the true coefficient vector and 
the matrix of the predictors, or more precisely, on its Gram matrix, [Meinshausen 


& Biihlmann (2006) Zhao fc Yu (2006')| Zhang & Huang (2008) However, the 


underlying assumptions are typically rather restrictive, and cannot be checked in 
practice. 

In order to overcome its initial disadvantages, many modifications of the Lasso 
were suggested. For example, the Adaptive Lasso, Zou (2006), is a two stage 
procedure with a second step weighted Lasso, that is, some predictors get less 
penalty than others; When a grouped structure of the predictors is assumed, the 


Group Lasso, Yuan & Lin (2006), is often used; The Elastic Net estimator, Zou| 


& Hastie (2005), is intended to deal with correlated predictors. It is obtained by 


adding a penalty on the f' 2 ~iiorm of the coefficients vector together with the 


Lasso penalty. Zou & Hastie (2005) also empirically found that the Elastic Net’s 
prediction accuracy is better than the Lasso’s. 

In the high-dimensional setting, the task of finding the true model might be 
too ambitious, if meaningful at all. Only in certain situations, which could not 
be identified in practice, model selection consistency is guaranteed. Even in the 
classical setup, with more observations than predictors, there is no model selection 
consistent estimator unless further assumptions are fulhlled. This leads us to 
present a different objective. Instead of searching for a single “true” model, we 
aim to present a number of possible models a researcher should look at. Our 
goal, therefore, is to hnd potential good prediction models. Since data are not 
generated by computer following one’s model, there is a benefit in finding several 
models with similar performance if they exist. In short, we suggest to find the 
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best models for each small model size. Then, by looking at these models one may 
reach interesting conclusions regrading the underlying problem. Some of these, 
as we do below, can be concluded using statistical reasoning, but most of these 
should be reasoned by a subject matter expert. 

In order to find these models, we implement a search algorithm that uses 


simulated annealing, Kirkpatrick et ah (1983), The algorithm is provided with 
a “score” for each predictor that we suggest to get using a multi-step procedure 
that implements both the Lasso and the Elastic Net (and then the Lasso again). 
Multi-step procedures in the high-dimensional setting have drawn some attention 


and were demonstrated to be better than using solely the Lasso, Zou (2006) Bickel 


et al. ( 2010 ) 


The rest of the paper is organized as follows. Section [^presents the concept of 
minimal class of models and the notations. Section describes a search algorithm 
for relevant models, and gives motivation for the sequential use of the Lasso and 
the Elastic Net when calculation a score to each predictor. Section consists 
of a simulation study and two examples of data analysis using a minimal class of 
models. Section [^suggests a short discussion. Technical proofs and supplementary 
data are provided in the appendix. 


2 Description of the problem 

We start with notations. First, denote ||n||g := g > 0 for the ^q 

(pseudo) norm of any vector v, ||n||o = limq_^o the cardinality of v. The data 
consist of a predictors matrix, Xnxp = (W^) ... and a response vector, 

Ynxi- WLOG, X is centered and scaled and Y is centered as well. We are mainly 
interested in the case p > n. The underling model is K = Xf3 + e where e„xi is a 
random error, E(e) = 0, V(e) = a^I, I is the identity matrix, fd is an unknown 
parameter and its true value is denoted by (3^. 

Denote S C { 1 , ...,p} for a set of indices of X. We call S a model. We use 
s = [S'! to denote the cardinality of the set S. Denote also Sq := {j : /9° 7 ^ 0} and 
■So = |*S'o| for the true model, and its size, respectively. For any model S', we define 
Xs to be the submatrix of X which includes only the columns specified by S. Let 
to be the usual least square (LS) estimator corresponding to a model S', that 
is, 

= (XjXs)-^XjK, 

provided X^Xs is non singular. 
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Now, the straightforward approach to estimate Sq given a model size k is to 
consider the following optimization problem: 


min-||F-X/3||2, 
0 n 


s.t \\P\\q = k. 


( 1 ) 


Unfortunately, typically, solving Q is computationally infeasible. Therefore, other 
methods were developed and are commonly used. These methods produce sparse 
estimators and can be implemented relatively fast. We hrst present here the Lasso, 


Tibshirani (1996), dehned as 


/3^ = argmin -||y-X/3||2 + A||/3||i 
0 vn 


( 2 ) 


where A > 0 is a tuning constant. For some applications, a different amount of 
regularization is applied for each predictor. This is done using the weighted Lasso, 
dehned by 

[3^ = argminr-llU - X/3\\l + AHw • /3||i) (3) 

j 3 \n / 

where tc is a vector of p weights, Wj > 0 for all j, and a ■ 6 is the Hadamard (Schur, 


entrywise) product of two vectors a and b. The Adaptive Lasso, Zou (2006) 


IS 


one example of using a weighted Lasso type estimator. Next is the Elastic Net 
estimator 

= argmin[-||y - Xl3\\l + Ai||/3||i + AapHs)- (4) 

This estimator is often described as a compromise between the Lasso and the well 


known Ridge regression, Hoerl & Kennard (1970), since it could be rewritten as 


= argmin(^-||y - X(3\\l + X{a\\/3\\i + (1 - a)||/3||^) j. (5) 

Let Pn be a sequence of estimators for /3 and let Sn be the sequence of corre- 


J_jC L UC CXi v/i “o LlliicxLL/i o iU/i ClliLJ. 1“ u UC 

sponding models. Model selection consistency is commonly 


dehned 


as 


lim P{Sn = Sq) = 1. 


( 6 ) 


If p n and small, then criteria based methods e.g., BIG, Schwarz (1978)1 can 
be model selection consistent if p is hxed or if suitable conditions are fulhlled. 


c.L, Wang et ah (2009) and references therein. However, these methods are rarely 
computationally feasible for large p. For p > n, it turns out that practically 
strong and unverihable conditions are needed to achieve ([^ for popular regular- 
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ization based estimators such as the Lasso: Zhao fc Yu (2006)] and Meinshausen & 


Biihlmann (2006) the Adaptive Lasso: Huang et ah (2008) the Elastic Net: Jia & 


Yu (2010)1 but also for Orthogonal Matching Pursuit (OMP), which is essentially 


forward selection: Tropp (2004) || and Zhang (2009) 


In light of these established results, we suggest to pursue a different goal. 
Instead of Ending a single model, we suggest to look for a group of models. Each 
of these models should include low number of predictors, but it should also be 
capable of predicting Y well enough. Finally, G = G{k, rj) is called a minimal class 
of models of size k and efficiency rj if 


^ ; 1^1 =K^-\\Y 

n 


XsK^Wi < min 1-11Y 
* ^ |5'|=k n 


+ ( 7 ) 


One could control how similar the models in Q are to each other in terms of 
prediction, using the tuning parameter rj. A reasonable choice is t] = ca^ with 
some c > 0. If is unknown, it could be replaced with an estimate, e.g., using 
the Scaled Lasso, Sun & Zhang (2012) An alternative to G is to generate the 
set of models by simply choosing for each k the M models having the smallest 
sample MSE, for some number M. The LS estimator, minimizes the sample 
prediction error for any model S with size s < n Thus, this estimator is used 
for each of the considered models. 

Note that G depends on k, the desired model size. However, in practice one 
may want to find G for a few values of k, e.g., k = 1,..., 10, and then to examine 
the pooled results, Another option is to replace the Mean Square 

Error (MSE) n“^||Y — Xs/3g^\\2 in the definition of G with one of the available 


model selection criteria, e.g., AIC , Akaike (1974), BIC or Lasso. Note that we are 
interested in situations where there are fair models with a relatively very small 
number of explanatory variables out of the p available. 

At this point, a natural question is how can we benefit from using a minimal 
class of models. Examining the models in G may allow us to derive conclusions 
regarding the importance of different explanatory variables. If, for example, a 
variable appears in all the models that belong to we may infer that it is essential 
for prediction of Y, and cannot be replaced. We demonstrate this kind of analysis 
in Section 14.21 

Another possibility is to use one out of the many aggregation of models meth¬ 
ods estimates. Aggregation of estimates obtained by different models was sug¬ 


gested both for the frequentist, Hjort & Claeskens (2003), and for the Bayesian, 


Hoeting et ah (1999) The well known “Bagging”, Breiman (1996), is also a 
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technique to combine results from various models. Averaging across estimates ob¬ 
tained by multiple models is usually carried out to account for the uncertainty in 
the model selection process. We, however, are not interested in improving predic¬ 
tion per se, but in identifying good models. Nor are we interested in identifying 
the best model, since this is not possible in our setup, but in identifying variables 
that are potentially relevant and important. 


2.1 Relation to other work 


A similar point of view on the relevance of a variable was given by Bickel fc Cai 


(2012) They considered a variable to be important if its relative contribution to 


the predictive power of a set of variables is high enough. Their next step was to 
consider only specific type of sets, such that their prediction error is high, yet they 
do not contain too many variables. 


Rigollet & Tsybakov (2012) investigated the relevant question of prediction 


under minimal conditions. They showed that linear aggregation of estimators is 
benehcial for high-dimensional regression when assuming sparsity of the number of 
estimators included in the aggregation. They also showed that choosing exponen¬ 
tial weights for the aggregation corresponds to minimizing a specific, yet relevant, 
penalized problem. Their estimator, however, is computationally impossible and 
they have little interest in variables and model identification. 

As described in Section]^ our suggested search algorithm for candidate models 
travels through the model space. We choose to use simulated annealing to prevent 
the algorithm from getting stuck in a local minimum. Various Bayesian model 
selection procedures consists moves along the model space, usually using a relevant 


posterior distribution, cf. O’Hara & Sillanpaa (2009), We, however, do not assume 
any prior distribution for the coefficient values. Our use of the algorithm is only 
as a search mechanism, simply to hnd as many as possible models that belong 
to Q. Convergence properties of the classical simulated annealing algorithm are 
not of interest to our use of it. We are interested in the path generated by the 
algorithm and not in its hnal state. 
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3 A search algorithm 


3.1 Simulated annealing algorithm 

In this section, we suggest an algorithm to find Q for a given k and rj. The problem 
is that I \ Y is unknown for all S, and since p is large, even for a relatively 
small K, the number of possible models is huge (e.g., for p = 200, k = 4 there are 
almost 65 million possible models). We therefore suggest to focus our attention 
on smaller set of models, denoted by At is a large set of models, but not 

too large so we can calculate MSEs for all the models within At in a reasonable 
computer running time. Once we have At and the corresponding MSEs, we can 
form Q by choosing the relevant models out of At. 

The remaining question is how to assemble At for a given k. Any greedy 
algorithm is bound to find models that are all very similar. Our purpose is to 
find models that are similar in their predictive power, but heterogeneous in their 
structure. 

Our approach therefore is to implement a search algorithm which travels be¬ 
tween potentially attractive models. We use a simulated annealing algorithm. 
The simulated annealing algorithm was suggested for function optimization by 


Kirkpatrick et ah (1983), The maximizer of a function f{6) is of interest. Let 
T = (^ 1 ,^ 2 , be a decreasing set of positive “temperatures”. For every tem¬ 

perature level t E T, iterative steps are carried out, before moving to the next, 
lower, temperature level. In each step, a random suggested move from the current 
6 to another 6' ^ 6 is generated. The move is then accepted with a probability 
that depends on the ratio exp[(/(6*') — /(6 *))/f]. Typically, although not neces¬ 
sarily, a Metroplis-Hastings criterion, [Metropolis et ah (1953) Hastings (1970) 


IS 


used to decide whether to accept the suggested move 9' or to stay at 9. Then, after 
a predetermined number of iterations N^, we move to the next f < t in T, taking 
the final state in temperature t as the initial state for t'. The motivation for using 
this algorithm is that for high “temperatures”, moves that do not improve the 
target function are possible, so the algorithm does not get stuck in a small area of 
the parameter space. However, as we lower the temperature, the decision to move 
to a suggested point is based almost solely on the criterion of improvement in the 
target function value. The name of the algorithm and its motivation come from 
annealing in metallurgy (or glass processing), where a strained piece of metal is 
heated, so that a reorganization of its atoms is possible, and then it colds off so 


the atoms can settle down in low energy position. See Brooks & Morgan (1995) 
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for a general review of simulated annealing in the context of statistical problems. 

In our case, the parameter of interest is or more precisely, the model S. The 
objective function, that we wish to maximize, is 

/(S) = -t||y-A's;3“||i. 


We now describe the proposed algorithm in more detail. We use simulated anneal¬ 
ing with Metropolis-Hastings acceptance criterion as a search mechanism for good 
models. That is, we are not looking for the settling point of the algorithm, but 
we follow its path, hope that much of it will be in neighborhood of good models, 
and hnd the best models along the path. 

We say the algorithm is in step {t, i) if the current temperature is t G T and 
the current iteration in this temperature is i G {1,...,W}- For simplicity, we 
describe here the algorithm for Nt = N for all t. Let SI and j3l be the model 
and the corresponding least square estimator in the beginning of the state 
respectively. An iteration includes a suggested model a least square estimator 
for this model, (31^, and a decision whether to move to and (31'^ or to stay at SI 
and /3j. We now need to dehne how S^^ is suggested and what is the probability 
of accepting this move. 

For each S'^, we suggest by a minor change, i.e., we take one variable out 
and we add another in, and then obtain (31'^ by standard linear regression. Assume 
that for every variable j G (1, ...,p) we have a score 7 ^, such that higher value of 
reflects that the variable j should be included in a model, comparing with other 
possible variables. WLOG, assume 0 < 7 ^ < 1 for all j. We choose a variable 
r* G SI and take it out with the probability function 


out 

yi.r 


7^ 


-1 


E 

u&Sl 




-1 ■ 


Vr G 


SI. 


( 8 ) 


Next, we choose a variable ^ SI and add it to the model with the probability 
function 


Pil 


It 


E 

uiSl 


lu 


'it i s\. 


(9) 


Thus, 




and we may calculate the LS solution /5E for the model AE- The hrst part of our 
iteration is over. A potential candidate was chosen. The second part is the decision 



whether to move to the new point or to stay at the current point. Following the 
scheme of simulated annealing algorithm with Metropolis-Hastings criterion we 
calculate 


q = exp 



y - XsiMl - 



p(si* ^ 5;) 

p(St ^ si*) 


where 


p(Sl ^ Si*) = Piii‘. p“ 
p(Si* ^ Si) = Piii,. pj 


We are now ready to the next iteration i + 1 by setting 




(S;+,/3;+) w.p min(l,(/) 

(Si,l3i) w-P max(0, l-(/). 


( 10 ) 


Along the run of the algorithm, the suggested models and their corresponding 
MSEs are kept. These models are used to form A4{k), and Q can be then identihed 
for a given value of rj. 

We point out now several issues that should be considered when using the 
algorithm. First, the algorithm was described above for one single value of k. In 
practice, one may run the algorithm separately for different values of k. Another 
consideration is the tuning parameters of the algorithm that are provided by the 
user: The temperatures T; the number of iterations N] the starting point S'/p and 
the vector 7 = ( 71 , ..., 7 p). Our empirical experience is that the hrst three can be 
managed without too many concerns; see Section Regarding the vector 7 , a 
wise choice of this vector should improve the chance of the algorithm to move in 


desired directions. We deal with this question in Section |3.2[ However, in what 
follows we show that, under suitable conditions, the algorithm can work well even 
with a general choice of 7 . 

Dehne Sq, sq and as before and let /i = X(3^. That is, F = /i + e. We hrst 
introduce a few simple and common assumptions: 


(Al) M\l = 0{n) 

(A2) So is small, i.e., sq = 0(1)- 
(A3) p = n“, a > 1 
(A4) e~W(0,a2j) 
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Denote for the set of positive entries in 7 . That is, C ...p} is a 

(potentially) smaller group of predictors than all the p variables. Denote also 
= \A^\ for the size of A^ and 7 mm := min 7 j for the lowest positive entry in 7 . 

i£A-y 

Informally, the algorithm is expected to preform reasonably well if; 

1. The true model is relatively small (e.g., with 10 active variables). 

2. A variable in the true model is adding to the prediction of a set of variables 
if a very few (e.g., 2 ) other variables are in the set. 

Our next assumption is more restrictive. Let S be an interesting model with 
size So—a model with not too many predictors and with a low MSE. The models 
we are looking for are of this nature. We facilitate the idea of S being an inter¬ 
esting model by assuming that is close to p (in the asymptotic sense). We 

virtually assume that for every model with s = s = |^|, which is not S, if we 
take out a predictor that is not part of S, and replace it with a predictor from S, 
the subspace spanned by the new model is not much further from fi, comparing 
with the subspace spanned by the original model. Formally, denote Vs for the 
projection matrix onto the subspace spanned by the columns of the submatrix 

Xs- 

(Bl) There exist to > 0 and a constant c > 0, such that for all S, 151 = Sq — 1, 
for all j G 5 n j' G 5'^ fl S^, and for a large enough n 


1 

n 


\\Vs-M 


2 

2 


WVs^vW 


> 4to log c. 


( 11 ) 


where 5* = 5 U {r}. 

We note that since c could be lower than one the right hand side of ( [IT| ) can 
be negative. The following theorem gives conditions under which the simulated 
annealing algorithm is passing through an interesting model S. More accurately, 
the theorem states that there is always strictly positive probability to pass through 
S in the next few moves. This result should apply for all models that Assumption 
(Bl) holds for. Note however, that we do not claim that the algorithm finds 
all the models in a minimal class. Proving such a result would probably require 
complicated assumptions on models with larger size than sq, and their relation to 
S and other interesting models. 

Let P^{S'\S) be the probability of passing through model S' in the next m 
iterations of the algorithm, given the current temperature is f, and the current 
state of the algorithm is the model S. 
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Theorem 3.1 Consider the simulated annealing algorithm with n = Sq and with a 
7 vector such that 'ymin > C- Let Assumptions (Al)-(A4) hold and let Assumption 
(B1) hold for some temperature to and with c = c^. If S C A^ then for all S' C 
with s = So, for all m > Sq — |*S fl S| and for large enough n, 


Kis\s) > 


_'5o(^7 '®o) 


so 


( 12 ) 


A proof is given in the appendix. Theorem 3.1 states that for any choice of 


the vector 7 snch that the entries in 7 are positive for all the predictors in S, 
the probability that the algorithm wonld visit a S^ in the next m moves is always 
positive, provided the temperatnre is high enongh, and provided it is possible to 
move from the current model to S' in m moves. Recall that our intention here is 
to use the algorithm as a search algorithm for several models. 

For the classical model selection setting with p < n, a similar method was 


suggested by Brooks et ah (2003) Their motivation is as follows. When searching 
for the most appropriate model, likelihood based criteria are often used. However, 
maximizing the likelihood to get parameters estimates for each model becomes 
infeasible as the number of possible models increases. They therefore suggest to 
simplify the process by maximizing simultaneously over the parameter space and 
the model space. They suggest a simulated annealing type algorithm to implement 


this optimization. The algorithm Brooks et ah (2003) suggested is essentially an 
automatic model selection procedure. 


3.2 Choosing 7 

The simulated annealing algorithm described above is provided with the vector 7 . 
The values 71 , ..., 7 p should represent the knowledge regarding the importance of 
the predictors, although we do not assume that any prior knowledge is available. 
As it can be seen in equations (§-( 0 , predictors with high 7 values have larger 
probability to enter the model if they are not part of the current model, and lower 
probability to be suggested for replacement if they are already part of it. Since p 
is large, we may also beneht if 7 includes many zeros. 

One simple choice of 7 is to take the absolute values of the univariate correla¬ 
tions of the different predictors with Y. We could also threshold the correlations in 
order to keep only predictors having large enough correlation (in absolute value) 
with Y. However, using univariate correlations is clearly problematic since it 
overlooks the covariance structure of the predictors in X. 
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Another possibility is to first use the Lasso with a relatively low penalty, and 
then to set ■jj = |/d^|/||/d'^||i. The idea behind this suggestion is that predictors 
with large coefficient value may be more important for prediction of Y. 

However, as discussed in Section]^ the Lasso might miss some potentially good 
predictors. It is well known that the Elastic Net may add these predictors to the 
solution, although it might also add unnecessary predictors. Moreover, it is not 
clear how to choose jj using solely the Elastic Net. The Lasso and the Elastic 
Net estimators are not model selection consistent in many situations. However, 
for our purpose, combining both methods together may help us get a reservoir of 
promising predictors. 


Zou & Hastie (2005) provided motivation and results that justify the common 


knowledge that the Elastic Net is better to use with correlated predictors. Since 
we intend to exploit this property of the Elastic Net, this paper offers an additional 
theoretical background. We present a more general result later on this section, 
but for now, the following proposition demonstrates why the Elastic Net tends to 
include correlated predictors in its model. 

Proposition 3.2 Define X and Y as before, and define by (|^. Denote 
p = Assume |/3f^| > cy for some cy > 0. If |p| > 1 — A 2 c|/||H ||2 


then 1/32 


EN\ 


> 0 . 


3.2 


gives motivation for why (3 


EN 


A proof is given in the appendix. Proposition 
has typically a larger model than j3^. It also quantifies how much correlated two 
predictors need to be so the Elastic Net would either include both predictors or 
none of them. 

Going back to our 7 vector, the next question is how to use the Lasso and 
the Elastic Net in order to assign “scores” to each predictor. Let Sl and be 
the models that correspond to (3^ and (3^^, respectively. Define for the group 
of predictors that were part of the Elastic Net model but not part of the Lasso 
model and Sout for the predictors that were not included in any of them. Note 
that Sl'A S+ = SlA Sout = 5*+ n Sout = 0 and SlA S+V3 Sout is {1, ...,p}. Define 


/37i) = argmin(i||r - A'/?||l + A V i e (0.1), 

S ''IT' . , ' 

j=l 

and let Si/{6) be the appropriate model. In this procedure, a reduced penalty 
is given for predictors that /3^ might have missed. Thus, these predictors are 
encouraged to enter the model, and since they may take the place of others. 
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predictors in Sl that their explanation power is not high enough are pushed out 
of the model. Note that /d+((5) is a special case of as dehned in ([^, with 

wj = 

We demonstrate how the reduced penalty procedure works using a toy example. 
A data set with n = 30 and p = 50 is simulated. The true value of f3 is taken to 
be (3^ = (0.5 0.5 1 1 1 0 0 ... 0)^ and is taken to be one. The predictors are 
independent normal variables with the exception of 0.8 correlation between 
and Predictor 1 is included in the Lasso model, however predictor 2 is not. 

Figure [^presents the coefficients’ estimates of Xd),X^^) and X*^^^ when lowering 
the penalty of X^^b Note how X*^^^ enters the model for low enough penalty while 
X*^^^ leaves the model for low enough penalty (on X( 2 )). 



Predictor 



Figure 1: Toy example: coefficients’estimates for predictors X^^b and N^When lowering 
the Lasso penalty for only. The rightmost point corresponds to a Lasso procedure with 
equal penalties for all predictors 


We suggest to measure the importance of a predictor j G S'+ by the highest 5 
such that i G S^{5). On the other hand, the importance of a predictor f G Sl, 
can be measured by the highest 5 such that f ^ (now, smaller 6 reflects j' 

is more important). With this in our mind, we continue to the derivation of 7 . 

Let A = ((5o < < ... < Sh) be some grid of [ 0 , 1 ], with 5o = 0 and 6h = 1. 

For each 5 G A, we obtain Dehne 


argmaxji : 7 ^ 0 } j ^ Sl 

i 

argmax{z : /3+(5j)j =0} j G Sl 

i 


and if the argmax is over an empty set, dehne it = Q. Let 6^ := 5i*. Now, we 
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suggest to choose 7 j as follows: 


0 j G Sout 

.1-1 J e Si. 

for all j G {1, This choice of 7 has the following nice properties. 


• A predictor j ^ Sl with i* = 0 is excluded from consideration. 

• On the other hand, for a predictor j & Sl, ii i* = 0 than 7 ^ = 1, which is 
the maximal possible value. Even when the penalty for other predictors was 
dramatically reduced, leading to their entrance to the model, j remains part 
of the solution and hence it is essential for prediction of Y. 

• Since predictors in Sl were picked when equal penalty was assigned to all 
predictors, they get priority over the predictors in 5+. 

• However, for two identical predictors (or highly correlated predictors) = 

such that j E Sl and j' ^ Sl, we get a desirable result. By Proposition 

and j ^ S^{6h-i)- Therefore i* = i*, = h — 1, and hence if 6 h-i is taken to 
be close to one, then 7 ^ ~ 7 ^/ ~ 0.5 as one might want. 


3.2 


we know that X) G S+. Now, for 6 h-i < 1 it is clear that j' G S!^{6h-i) 


Proposition |3.2| deals with the case of two correlated predictors. In practice, 
the covariance structure may be much more complicated. Therefore the question 
arises: can we say something more general on the Elastic Net in the presence of 
competing models? Apparently we can. Let Mi and M 2 be two models, that 
is, two sets of predictors, that possibly intersect. Assume that the Elastic Net 
solution chose all the predictors in Mi. What can we say about the predictors in 
M 2 ? Are there conditions on Xm 2 , ^Mi and Y such that all the predictors in M 2 
are also chosen? If the answer is yes (and it is, as Theorem 3.3 states), it justihes 
our use of the Elastic Net to reveal more relevant predictors. In our case, the 
relevant predictors are the building blocks of models in Q. 

In order to reveal this property of the Elastic Net, we analyze , the solution 
of (|^, when assuming all the predictors in Mi have non-zero values. We denote 
M^~'> for (Ml U M2Y, the set of predictors that are not included in Mi or M2 and 
X = Xj^(-) for the appropriate submatrix of X. We let and be the 

coordinates of that correspond to Mi, M 2 and (MiUM 2 )'^, respectively. Then, 
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we show that we can concentrate on F = Y — which is the nnexplained 

residnal of V, after taking into acconnt X. Finally, we show that both Mi and M 2 
are chosen by the Elastic Net if the prediction of Y using Mi, namely , 

projected onto the subspace spanned by the columns of M 2 is correlated enough 
with Y. Formally, 

Theorem 3.3 Define as before. Let Mi and M 2 be two models with the ap¬ 
propriate submatrices Xm^ and Xm 2 ■ Define X and Y as before. Define and 
fi™ as before. Denote Vm 2 far the projection matrix onto the subspace spanned 
by the columns of Xm 2 - WLOG, assume IM 2 I < |Mi| and that all the coordinates 
of are different than zero. Finally, if 

Y^Vm2XmM'! > cfiX,,X2 ,XM„YJZ^), (13) 


then all the coordinates of are different than zero. 


A proof and a discussion on the technical aspects of condition (13) and the constant 


Cl are given in the appendix. Theorem |3.3| states that under a suitable condition, 
predictors belong to at least one of two competing models are chosen by the Elastic 
Net. In our context, when we have a model Mi with a good prediction accuracy, 
i.e., Xmi/3m^ is close to Y, then predictors in any another model M 2 which has 
similar prediction, that is VM 2 XMifiM^ is also close to Y, would be chosen by the 
Elastic Net. Hence, these predictors are expected to have a positive value in 7 , 
and our simulated annealing algorithm would pass through these models, provided 


the conditions in Theorem 3.1 are met. Therefore, these models are expected to 
appear in Q. 


4 Numerical Results 

4.1 Simulation Study 

We consider a setup in which there are few models one would want to reveal. 
The following model is used Y = Xfi + e, e ~ iV(0,/) with fij equals to C for 
i = 1 , 2 ,..., 6 and zero for j > 6 . C is a constant chosen to get a desired signal to 
noise ratio (SNR). The predictors in X are all i.i.d. A^(0,/) with the exception of 
and which dehned by 

= ^[X^L + x(2)] + ^1, 6 ~ at, (0, 
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~ Nn (^ 0 , 


where and ^2 are independent. In this scenario, there are 4 models we would 
like to hnd: (I) {1,2,3,4,5, 6 }; (II) (5, 6 ,7,8}; (III) (3,4,5, 6 ,7}; and (IV) (1,2,5, 6 , 8 }. 
For each simulated dataset, we do the following: 


1. Obtain 7 as explained in Section 3.2 The tuning parameter of the Lasso is 
taken to be the minimizer of the cross-validation MSE. For the Elastic Net, 
a in ([^ is taken to be 0.4. 


2. Run the simulated annealing algorithm for k = 4, 5, 6. The tuning param¬ 
eters of the algorithm are chosen quite arbitrarily: T = 10 x (0.7^, 0.7^,..., 
O. 720 ); A = (0,0.02, 0.04,..., 0.98, 1); Nt = N = 100 for all t G T. 


3. Then, for each model (I)-(IV), we check whether the model is the best 
model obtained (as measured by MSE) among models with the same size. 
For example, we check if Model (II) is the best model out of all models that 
were found with k = A. We also check whether the model is one of the top 
hve models among models with the same size. 


A 1000 simulated datasets were generated for each different scenario: For n = 100, 
p = 200, 500,1000 and for SNR = 1, 2,4, 8,12,16. Table displays the proportion 
of times each model was chosen, either as the best one, or as one of the top hve 
models. The results are as one might expect. For large SNR, the models are chosen 
more frequently. However, models (III) and (IV) are competing, in the sense that 
they both include hve predictors. Even for large SNR, each of the models, (III) 


and (IV), are chosen in about 50% of the cases. As recommended in Section 3.1 


we should start the algorithm from diherent initial points, that is, diherent initial 
models. Figure presents comparison between running the algorithm once and 
three times, from diherent points. Note the improved results for models (III) and 
(IV) when we start the algorithm from three diherent starting points. The results 
described in this section are quite similar to results obtained when forming Q{k, rj) 
as dehned in ([^, for each k = 4, 5, 6 separately and using an arbitrary small value 
of rj. 


4.2 Real data sets 

We demonstrate the utility of using a minimal class of models in the analysis of 
two real datasets. The tuning parameters of the Lasso and the Elastic Net were 
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Table 1: Proportion that each model is chosen as best model or as one of top five models for 
different number of potential predictors {p) and various SNR values. 




p = 

200 

P = 

500 

P = 

1000 

SNR 

Model 

Best 

Top 5 

Best 

Top 5 

Best 

Top 5 


(I) 

0.00 

0.01 

0.00 

0.00 

0.00 

0.00 

1 

(11) 

0.42 

0.62 

0.28 

0.46 

0.23 

0.38 

1 

(III) 

0.04 

0.08 

0.01 

0.02 

0.00 

0.00 


(IV) 

0.04 

0.08 

0.02 

0.03 

0.00 

0.01 


(I) 

0.10 

0.12 

0.05 

0.06 

0.04 

0.05 

O 

(11) 

0.94 

0.96 

0.92 

0.94 

0.94 

0.95 

Z 

(III) 

0.27 

0.34 

0.18 

0.24 

0.15 

0.17 


(IV) 

0.28 

0.37 

0.18 

0.22 

0.14 

0.17 


(I) 

0.38 

0.38 

0.20 

0.20 

0.11 

0.11 

4 

(11) 

0.96 

0.96 

0.96 

0.96 

0.96 

0.95 


(III) 

0.38 

0.46 

0.31 

0.36 

0.22 

0.24 


(IV) 

0.39 

0.46 

0.28 

0.31 

0.24 

0.26 


(I) 

0.72 

0.72 

0.46 

0.46 

0.32 

0.32 

O 

(11) 

0.97 

0.97 

0.97 

0.97 

0.96 

0.96 

o 

(III) 

0.41 

0.48 

0.36 

0.40 

0.30 

0.31 


(IV) 

0.44 

0.50 

0.34 

0.37 

0.29 

0.31 


(I) 

0.86 

0.86 

0.66 

0.66 

0.49 

0.49 

12 

(11) 

0.98 

0.98 

0.97 

0.97 

0.96 

0.96 


(III) 

0.49 

0.55 

0.41 

0.44 

0.32 

0.34 


(IV) 

0.42 

0.48 

0.37 

0.40 

0.32 

0.34 


taken to be the same as in Section 4.1 


The tuning parameters of the simulated 


annealing algorithm were T = 10 x (0.7^, 0.7^,..., 0.7^°), A = (0,0.01,0.02,..., 


0.98, 0.99,1), and Nt = N = 100 for all t E T. 


4.2.1 Riboflavin 

We use a high-dimensional data about the production of riboflavin (vitamin B2) in 


Bacillus subtilis that were recently published, Biihlmann et ah (2014) The data 
consist p = 4088 predictors. These are measures of log expression levels of genes 
in n = 71 observations. The target variable is the (log) riboflavin production rate. 

Sl included 40 predictors (and intercept), and Sen included 59 predictors when 
taking the tuning parameters as described in Section 4.1 In total, we considered 
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starting points 

— One 

- - Three 


Modei 

^( 1 ) 

(II) 

(III) 

(IV) 


Figure 2: Proportion that each model is chosen as one of top five models for different number 
of potential predictors (p) and various SNR values. There is an apparent improvement when 
running the algorithm from three starting points. 


61 different predictors (i.e., genes). Panel (a) of Fignre presents the histogram 
of the positive values in 7 . 

We run the algorithm from three random starting points for each model size 
between 1 and 10. We kept the hve best models for each size and starting point. 
We then combined these models to get, after removal of duplicates, a total of 112 
models. See Table for the number of unique models as a function of the model 
size. The following insights are drawn from examining more carefully the models 
we obtained (see Tablein the appendix): 

• In total, the models include 53 different predictors. Out of these, 35 predic¬ 
tors appear in less than 10 % of the models, meaning they are probably less 
important as predictors of riboflavin production rate. 

• Gene number 2564 appears in all models of size larger than 3 and in 5 out of 
8 models of size 3. However, this gene is not included in any of the smaller 
models. This gene is the only one that appears in more than half of our 
models. We can infer that while this gene does not hold an effect strong 
enough comparing to other genes in order to stand out, it has a unique 
relation with the outcome predictor that could not be mimicked using other 
combination of genes. 
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0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 

Y Y 

(a) Riboflavin data (b) Air pollution data 

Figure 3: Histograms of the values of 7 for positive entries only in the two dataset analysis 
examples. 

Table 2: Riboflavin data: Number of unique models for each model size after running the 
algorithm from 3 different starting points 


Model size 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

Number of models 

5 

5 

8 

6 

13 

15 

15 

15 

15 

15 


At least one gene from the group {4002,4003,4004,4006} is contained in all 
models of size larger than one, although never more than one of these genes. 
Genes number 4003 and 4004 appear more frequently than genes number 
4002 and 4006. Looking at the correlation matrix of these genes only, we see 
they are all highly correlated (pairwise correlations > 0.97). Future research 


could take this finding into account by using, e.g., the Group Lasso, Yuan 


fc Lin (2006) 


• Similarly, either gene number 1278 or gene number 1279 appear in about 
half of the models. They are also strongly correlated (0.984). The same 
statement holds for genes number 69 and 73 (correlation of 0.945) as well. 

• The impotence of genes number 792,1131, and possibly others, should be 
also examined since each of them appears in a variety of different models. 

We now compare our results to models obtained using other methods, as reported 
in Biihlmann et ah (2014) The multiple sample splitting method to get p-values. 


Meinshausen et ah (2009), yields only one signihcant predictor. Indeed, a model 
that includes only this predictor is part of our models. If one constructs his model 
using the stability selection, Meinshausen & Biihlmann (2010), as a screening 


process for the predictors, he would get a model consisting three genes, which 
correspond to columns number 625, 2565 and 4004 in our X matrix. However, 
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this model is not included in our top models. In fact, the highest MSE for a 
model in our 8 models of size 3 is 0.2047 while the MSE of the model suggested 
using the stability selection is 0.2703, more than 30% difference! 


4.2.2 Air pollution 


We now demonstrate how the proposed procedure can be used for traditional, 
purportedly simpler, problem. The air pollution data set, McDonald fc Schwing| 


(1973), includes 58 Standard Metropolitan Statistical Areas (SMSAs) of the US 


(after removal of outliers). The outcome variable is age-adjusted mortality rate. 
There are 15 potential predictors including air pollution, environmental, demo¬ 
graphic and socioeconomic predictors. Description of the predictors is given in 
Table in the appendix. 

There is no guarantee that the relationship between the predictors and the 
outcome variable has linear form. We therefore include commonly used transfor¬ 
mations of each variable, namely natural logarithm, square root and power of two 
transformations. Considering also all possible two way interactions, we have a 
total of 165 predictors. 

High-dimensional regression model that includes transformations and inter¬ 
actions has been dealt with in the literature. For example, by using two steps 


procedures, e.g., Bickel et al. (2010), or by solving a relevant optimization prob¬ 
lem, e.g., Bien et al. (2013) Our procedure has a different goal, since we are not 
looking for the best predictive model, but rather for a meaningful insights about 
the data. 

Following the Lasso and Elastic Net step, we are left with 44 predictors with 
positive (one untransformed predictor, 3 log transformations, 4 square root 
transformations, 8 power of two transformations and the rest are interactions). 
Panel (b) of Figure [^presents the histogram of the positive values in 7 . 

For each k = 1,2,..., 10, we run the algorithm from three starting points, 
and then keep the 5 best models. In total, we get 126 unique models. Table 
summarizes the results for prominent predictors, that is, predictors that appear 
in at least quarter of the models we obtained. The table presents a matrix of the 
joint frequency of each two predictors. Each cell in the table is the number of 
models including both the predictor listed in the row and the predictor listed in 
the column. The diagonal is simply the number of models that a predictor appears 


m. 


Three (transformed) main effects are chosen. The nitric oxide pollution is in- 
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Table 3: Frequency that each two predictors together in the 126 models. The diagonal is simply 
the number of models that a predictor appears in. For example, in 27 models both log(NOx) 
and Vnwht appear 



(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

(7) 

(8) 

(1) log(NOx) 

97 

27 

36 

30 

50 

31 

33 

35 

(2) -s/nwht 


33 

7 

8 

14 

10 

0 

0 

(3) VHC 



37 

12 

18 

10 

16 

15 

(4) HC X prec 




33 

26 

17 

18 

8 

(5) jant X ovr65 





66 

30 

27 

26 

(6) pphs X educ 






37 

14 

14 

(7) nwht X ofwk 







46 

1 

(8) nwht X mst 








43 


valuable for prediction of mortality rate. This predictor (in a log shape) appears 
in a large majority of the models. Apart from this predictor, the hydrocarbon 
pollution appears (after a square root transformation), but only in about 30% 
of the models. There is, however, one result that catches the eye. The two ze¬ 
ros in the matrix (second row, last two values) mean that interactions involving 
the percentage of non-white population are only part of models that do not in¬ 
clude the percentage of non-white population as a main effect. Moreover, the two 
interactions do not make much sense. The evident conclusion is that the two in¬ 
teractions took the place of the main effect. We therefore repeat the analysis after 
the removal of these two interactions. 

The new frequency matrix is displayed in Table The conclusion regarding 
the importance of the nitric oxide pollution remains. Nevertheless, hydrocarbon 
pollution is not relevant anymore. The percentage of non-white population appears 
untransformed but also after taking its squared root. However, this predictor 
appears in single form only for each model. We conclude that this predictor should 
be used for prediction of the mortality rate, but the question of transformation 
remains unsolved. 

Turning to the interactions. The interaction between percentage of elderly 
population and the average temperature in January appears while the appropriate 
main effects do not appear. However, the absence of age related effect is not so 
surprising since the outcome variable, the mortality rate, is age corrected. The 
interaction between the household size and the level of education appears in half 
of the models, whereas appropriate main effects do not appear. This interaction 
could be a proxy to other effects that were not measured. Interactions involving 


21 








Table 4: Frequency that each two predictors appear together in the 126 models obtained after 
removal of the two interactions. The diagonal is simply the number of models that a predictor 
is included. For example, in 43 different models both log(NOx) and V nwht appear. 



(1) (2) (3) (4) (5) (6) (7) (8) 

(1) nwht 

(2) log(prec) 

(3) log(NOx) 

(4) Vnwht 

(5) dens x prec 

(6) hnm x prec 

(7) jant X ovr65 

(8) pphs X educ 

37 13 36 0 12 17 28 26 

31 31 13 9 2 15 19 

106 43 28 40 67 62 

44 11 13 24 24 

30 14 26 20 

40 38 28 

68 47 

63 


the average precipitation appear less than other predictors. The interaction with 
hnmidity nsnally appears withont the main effect of precipitation. Nevertheless, 
both interactions shonld be taken into acconnt when constrncting a prediction 
model for the mortality rate. 


5 Discussion 


Model selection consistency is an ambitious goal to achieve when dealing with 
high-dimensional data. A “minimal class of models” was defined to be a set of 
models that should be considered as candidates for prediction of the outcome vari¬ 
able. A search algorithm to identify these models was developed using simulated 


annealing method. Under suitable conditions, that are outlined in Theorem 3T 
the algorithm passes through models of interest. 

A score for each predictor is given using the Lasso, the Elastic Net and a 
reduced penalty Lasso. These scores are used by the search algorithm. They 
are not necessarily optimal but we claim that they are sensible. Other scoring 
methods may achieve better results. On the other hand, the scores we use here 
may be used for other purposes. Theoretical justification for using the Elastic Net 
to unveil predictors the Lasso might have missed was also presented. 

A simulation study was conducted to demonstrate the capability of the search 
algorithm to detect relevant models. As illustrated using real data examples, a 
class of minimal models can be used to derive conclusions regarding the problem 
at hand. This is rarely the case that a researcher believes a one true model exists, 
especially in the p > n regime. Therefore, we suggest to abandon the search for 
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this “holy grail”, and to analyze the class of minimal models instead. 

It is well known that achieving good prediction and snccessfnl model selection 
simnltaneously, in a reasonable compntation time, is impossible, especially in the 
high-dimensional setting. We therefore snggested here to make a compromise. 
Onr approach is not necessarily optimal for prediction, nor for model selection. 
However, it offers a data analysis method that takes into acconnt the nncertainty 
in model selection, bnt ensures reasonable prediction accuracy. This method can 
be used for either prediction, parameter estimation or model selection. 

Appendix 


A Proofs 


A.l 


Proof of Theorem 


3.1 


We start with the following lemma. 


Lemma A.l Assume Y = fi + e and assume also (Al)-(A4). Let Sk = {S' : [S'! = 
k,(ds = the set of all models with k variables, such that (3s, 

the LS estimate, is unique. Denote S* = S U {j}, j ^ S for a model that includes 
S and additional variable j not in S. We have 


max e^{Xs*^s* - Xs^s) = 0p{n) 

O Gofc 

l<t<p 


Proof. Let fj be the vector of coefficients obtained by regressing 

column in X, on Xs and let Vj be the projection operator on the subspace spanned 

by the part of X^L which is orthogonal to the subspace spanned by Xs- That is, 

' ||.YO)-A's?j|li 

Let (3g^ be the coefficient estimate of X^L model S*, and let (3^3 be the coefficient 
estimates of the variables in S but for the model S*. Since {X^L _ Xs^j) is 
orthogonal to the subspace spanned by the columns of Xs we have 

XsJs- = .Y«>/3i. + XsPsi 

Jo 3 3 

= (A'«> - Ys6)/3J. + Xs(^ + 6/3j.) 
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Therefore, 


= - XsO)®. + XsPs 

= Vjy + XsPs- 


e^(X5*% - Xs^s) = + e^V,e. 

Now, since H'Pj/iHi < llhlli = 0(^)) 'w® get that for all j, e^VjH = Op{y/n). Next, 
let Zi,..., Zpk+i be A^( 0 , cr^) random variables and observe that the approximate 
size of the set {Sk} x {1, is We have for any a > 0 


P I max —e Vje > a | < P \ max \Zp > \ l — 1 < a 


S£Sk n 

P<3<P 


( 


I an 






j\ — 


cr^ J 


2{k + 1 ) logp + 0 ( 1 ) 


an 


Now, since p = n'^ and k = o[n/ logn) we get that 


P I max —e^Vje > a | = o(l) 

Ti 


and we are done. 


□ 


We can now move to the proof of Theorem 3.1 For simplicity, the notation of i 


as the iteration number for the current temperature t is suppressed. Note that it 
is enough to only consider models such that S' fl F = 0 and to consider m = sq. 
Denote Qt{S,g,j) for the probability of a move in the direction of S in the next 
iteration, that is, the probability of choosing a variable j G S H S'^ and replace it 
with a variable g E O S. Denote S' = {S/{j}} U {g} for this new model. We 
have 


Qt{S,g,i) 


= P{S -E S') min 


1 , exp 


Y -XsPs\\l-\\Y -Xs'Ps' 


P{S' -E S) 
P(S ^ S') 


(14) 

where P(S —>■ S') is the probability of suggesting S', given current model is S. 
Now, since 7 mm P C'y and since the maximal value in 7 equals to one by dehnition, 
we have for all S Y A^, 


l'(/^7 Sq) S ^ 7n — ^7 


So 


u^S 
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(15) 


1 


^ J- *0 
So < ^ — < —• 


Now, by substituting (15) into ([^-(10) we get 

V7i 


P{s S') = 


la 


> 


7« X)j;eS 7 „ So(/i 7 So) 


P{S' S) _ Ij X/n ^5 7 n X)i)eS 7„ 
P{S ^ ^0 72 In Y.V&S' ^ 


1 > 4 - 


Next, we have 


^\\Y - Xs^^s\\l-hy - Xs'h'Wl 

n n 

= - [(F - X^^sO + (>^ - Xsh) ^ 

n L 

= -Y^ (Xs^Ps'-Xsl3s 
n V 

1 T f T/" o "x/" /O ^ ^ T 


Xs'/3s' — Xsf3s 


= -/i iXs'f3s' — 
n \ 

= (Xs'Ps' — 

n V 


n 


H—e Xs'/3s' — Xsfis 


+ A„(5,5') 


(16) 


(17) 


where the second equality is due to $s and $s' being LS estimators. We get that 
an estimator in linear model achieves better (lower) sample MSE, if the correlation 
of the prediction using this estimator with Y is larger. Now, denote S" = S' U S. 
We have 


A„(S', S') — —e^ {Xs"^s" — XsPs) — {Xs''Ps" — Xs'Ps') 
n L 


and if we apply Lemma A.l twice we get that A„(S', S') = 0p(l). Now, regarding 
the hrst term in (p)^. 


n 


-/i 


/ — Xs^s ) — {Ps'y — Psy) 

/ n 

= LiiP6'Hi2-iip^f‘ii2)+A:,(s,s') 


(18) 


where A'^{S,S') = -/i^ ["P^/e — "P^ej. The content of the proof of Lemma 


A.l 


implies that X'^(S, S') = 0p(l). Now, by ( [TT] ) and (18) and since Assumption (Bl) 
holds for to we get that for large enough n 


- [\\Y - Xsl3s\\i-\\Y - Xs,l3s.u^ 

n 


> 4f log c. 


. 7 . 


(19) 
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Now, by substituting (16) and (19) into (14) we get that for large enough n, 


Qto{S,g,j) > 


5o) 


for all S' 7^ S', j G S' n and G S''^ fl S'. (12) follows from this immediately since 
for any integer m and for all S ^ S, 




{S:Sns=$} 

j&snsi 

ges^ns 


_' 5 o(h'y Sq) 


so 


A.2 Proof of Proposition 3.2 


Recall that the Elastic Net estimator minimizes 


\Y-xp\\l + x,\p\ + x,\m 


( 20 ) 


Now, WLOG assume that 13^^ is a solution such that > 0. For convenience, 
we omit the ^^EN” superscript from now on (i.e., (3 = /3^^). Dehne the subspace 


:= {/3 : Vi 7^ 1, 2 (3i = /3i, /3i = r/3i, /?2 = (1 - r)^i}. 


( 21 ) 


If the minimum of (20) over B is obtained for r 7^ 1, then given that predictor 1 


is part of the Elastic Net model, predictor 2 is also part of this model. 

WLOG, write down X as X = (X(i2) X_(i2)) where X(i2) = (X*^^^ X^^^) are 

the hrst two columns of X and X_(i2) are the rest of its columns. Similarly, we 
have (3'^ = {(3j^2) /^-(i2)) where (3(i2) is the hrst two entries in the vector (3 and 


/d_(i2) is the rest of the vector. Dehne Y = Y — X_(i2)/d-(i2)- We can rewrite (20) 


as 


\Y - X(i2)/?(12)||2 + Ai(|/3_(i2)| + |/?(12)|) + ^2 {\\/3-(12)111 + IIAl2)||2) (22) 


If the minimum of (22), on B, is achieved at 0 < r* < 1 then (32 must be non zero. 


Minimizing (22) on is essentially minimizing 


— 2Y'^ X(i2)(3(12) + ||X(12)/3(12)||2 + A 21 1/5(12)112 


(23) 


on B . Now, by the dehnition of in (21) and using simple algebra we get that 
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(23) equals to 


(r(.Y<=l - A'**') - A<"l) - -T)(l-p) + I - - r(l - t) 


This is a quadratic function of r, and by equating its derivative to zero we get 
that 

1 Y^{X(^)-XW) 


2 2A(A2 + 1-p) 


is the minimizer of (20) (the coefficient of the quadratic term is positive). Note 
that for we get the expected t* = ^ solution. Note also that this 

reveals no information regarding the Lasso where A 2 = 0. Next, we get that 
0 < r* < 1 if 

yT(x(2) _ x(i)) 


/3i(A2 + 1 — p) 
Since ||X(2) - X(i)||2 = 2(1 - p) we have 


< 1 . 


(24) 


|fT(x(2)_x(i))| <^|y^||xf)-Xf)| < 11X112^2(1-P), 


di)i 


i=l 


using the triangle inequality and then Cauchy-Schwartz inequality. It is assumed 
that /3i > CjS > 0 and it is known that ||X ||2 < Therefore, we may rewrite 


(24) as 


< 1 . 


V2\\Y\\2VT^j 

Cp{\2 + 1 - p) 

Now, Denote t = \/l — p,u = we have 

C/3 


— y/2ut + A 2 > 0. 


For A 2 > we get the result we want for all p’s. For A 2 < ^ we have 


pi - p > - 2 A 2 ), 


yi -P < - 2A2). 


(25) 

( 26 ) 


The RHS of (25) is larger than 1 if A 2 < y/2u — 1. That is, there is no suitable 


p for this case. The RHS of (26) is always positive, and for the same condition 
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A 2 < a/2m — 1, it also meaningful, i.e., {u — — 2 A 2 ) < and in terms of p, 


p> 1- 



- 2A2)' 


or alternatively, 



and by Taylor expansion for 2 X 2 /u we get 


p> 1- 



□ 


A.3 Proof of Theorem 3.3 


The proof is similar to the proof of Proposition 3.2 Let {3 = be the Elastic 


Net estimator and denote (3m for the values in (3 corresponding to the set of 
predictors M. We can partition the set of potential predictors {1, 2, ...,p} to four 
disjoint subsets: Mi fl M|; M( fl M 2 and Mi fl M 2 . We replace (21) with 


B '.—{(3 : (3m(-) — (^M-inM2 — /^MinM2 5 

l3MfnM2 = (1 ~ t)Q'( 3MinM§}- 


(27) 

(28) 


where (3m is dehned as the values in (3 corresponding to the set M and 0' is the 
matrix of coefficients obtained from regressing Xm-^dm^ on XM^nM 2 - AVe dehne 0 
to be an augmented version of 0', which we obtain by regressing Xmi on Xmj- 
That is, 

Xm2 0 = 'Pm2^Mi (29) 


Note that on B, 


X(3 — X(3m(-) + tXmi^Mi + (1 — 't)Xm2^(3mi 


Recalling that Y = Y — X(3ju(-), minimizing (|^ on is equivalent to minimize 


Y — tXmi^Mi — (1 — 'r)XM2 0/3Mill2+-Ai['7‘||/3Mi||i + (1 — ''‘)I|0/3 miI|i] 

+X2[A\Ml + {^-rf\\^Ml] (30) 








as a function of r. Using a first-order condition and substituting (29) we find that 
(30) is minimized for 


r = 


— {Y — — 'Pm2)Xmi^Mi + ^(1 |/3mi 111 — I |0/3mi 111) — ^ 2 ! |0/3mi | 


11(1 — 'Pm2)Xmi$Mi\\2 — -^2||^Mi||2 — •^2||0/3 mi||2 


(31) 


Before we continue, note that if X 2 = Xi then 0 is the identity matrix and 
Vm 2 ^Mi = Xi. Substituting these facts into (31), we get that r* = | as one 


might expect. Same result is obtained for the case M 2 C Mi. 

As it can be seen in ( [2^ , the coordinates of /3 m 2 are all different than zero if 
T* < 1. Now, since Vm 2 {I ~ 'PM 2 ) = 0 we get that r* < 1 if 


-Y^ {I -VM2)XM,f3M, + \ \{I -Vm2)XmM\1 - y||0/9Mj| 

Ai 


> - A2||/?Mi||2 


which is certainly true if 

X'^'Pm2XmiPmi —^l|0^Aril|i > —“^1 |/3 miI| i — A 2 I|/3mi1 12 + (32) 


which is true if the condition in (13) is fulhlled for the appropriate ci. 


□ 


B Supplementary tables for Section 4.2 


Table 5: The 112 models selected for the riboflavin data. Each row is a model, the numbers 
are the column number (the gene) in X. 


Model 


1 

1278 

2 

1279 

3 

4003 

4 

1516 

5 

1312 

6 

1278 4003 

7 

1303 4003 

8 

1279 4003 
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9 

1278 

4006 





10 

1279 

4004 





11 

69 

2564 

4003 




12 

73 

2564 

4003 




13 

144 

2564 

4003 




14 

69 

2564 

4004 




15 

69 

2564 

4006 




16 

792 

1478 

4002 




17 

792 

1478 

4003 




18 

792 

1478 

4004 




19 

73 

1279 

2564 

4004 



20 

73 

1279 

2564 

4003 



21 

144 

1279 

2564 

4004 



22 

73 

1849 

2564 

4004 



23 

73 

1279 

2564 

4006 



24 

144 

1279 

2564 

4003 



25 

73 

1279 

1849 

2564 

4003 


26 

69 

1849 

2564 

3226 

4003 


27 

69 

1425 

1640 

2564 

4003 


28 

69 

1849 

2564 

3226 

4004 


29 

69 

1425 

1640 

2564 

4004 


30 

144 

792 

1849 

2564 

4003 


31 

144 

1849 

2564 

3226 

4003 


32 

73 

974 

1279 

2564 

4003 


33 

144 

1278 

1425 

2564 

4003 


34 

73 

792 

2116 

2564 

4004 


35 

73 

1278 

1849 

2564 

4004 


36 

144 

1278 

1849 

2564 

4003 


37 

73 

1279 

1425 

2564 

4004 


38 

144 

792 

1849 

2027 

2564 

4004 

39 

73 

974 

1278 

1849 

2564 

4003 

40 

69 

415 

1849 

2564 

3226 

4004 

41 

73 

792 

974 

2116 

2564 

4003 

42 

73 

1279 

1640 

1849 

2564 

4004 

43 

69 

315 

792 

1849 

2564 

4004 

44 

73 

792 

1849 

2027 

2564 

4004 

45 

73 

792 

1303 

2116 

2564 

4003 


30 



46 

69 

792 

1849 

2027 

2564 

4003 



47 

69 

792 

1282 

1849 

2564 

4003 



48 

69 

792 

1131 

1849 

2564 

4003 



49 

73 

1131 

1278 

1524 

2564 

4006 



50 

144 

1131 

1303 

1524 

2564 

4006 



51 

73 

792 

1528 

1849 

2564 

4003 



52 

73 

792 

1294 

2116 

2564 

4003 



53 

69 

1131 

1278 

1524 

1762 

2564 

4006 


54 

144 

1279 

1762 

1820 

2027 

2564 

4004 


55 

69 

1279 

1425 

1640 

1820 

2564 

4006 


56 

144 

792 

1312 

1849 

2027 

2564 

4004 


57 

73 

1278 

1762 

1820 

1857 

2564 

4003 


58 

73 

1131 

1279 

1524 

1528 

2564 

4003 


59 

69 

792 

1303 

1849 

2484 

2564 

4003 


60 

69 

792 

1639 

1849 

2027 

2564 

4003 


61 

73 

315 

792 

1278 

1524 

2564 

4004 


62 

69 

315 

1425 

1524 

1640 

2564 

4004 


63 

73 

1131 

1279 

1857 

2116 

2564 

4004 


64 

144 

1131 

1279 

1857 

2116 

2564 

4004 


65 

144 

1101 

1131 

1279 

1762 

2564 

4004 


66 

69 

792 

1131 

1849 

2564 

3514 

4004 


67 

73 

792 

1279 

1478 

2027 

2564 

4002 


68 

73 

792 

1131 

1279 

1312 

2116 

2564 

4004 

69 

73 

315 

792 

1279 

1312 

2116 

2564 

4004 

70 

73 

315 

792 

1279 

1503 

2116 

2564 

4004 

71 

69 

792 

1131 

1279 

2116 

2564 

3288 

4006 

72 

73 

315 

1279 

1762 

1849 

2564 

3288 

4004 

73 

144 

974 

1131 

1279 

1524 

2564 

3514 

4003 

74 

144 

974 

1131 

1279 

1425 

1524 

2564 

4003 

75 

69 

792 

859 

1131 

1279 

2116 

2564 

4004 

76 

73 

792 

1279 

1849 

2484 

2564 

4004 

4006 

77 

73 

974 

1101 

1131 

1279 

2564 

3105 

4004 

78 

73 

792 

1131 

1312 

2116 

2242 

2564 

4004 

79 

73 

792 

974 

1131 

1364 

2116 

2564 

4004 

80 

73 

792 

1131 

1312 

1639 

2116 

2564 

4004 

81 

73 

792 

1131 

1364 

2116 

2242 

2564 

4004 

82 

73 

792 

1131 

1312 

2116 

2564 

3905 

4004 


31 



83 

144 

1131 

1279 

1524 

1528 

2484 

2564 

3465 

4004 


84 

73 

974 

1131 

1279 

1524 

2242 

2564 

3514 

4006 


85 

73 

974 

1131 

1279 

1524 

2242 

2564 

3465 

4006 


86 

73 

244 

792 

1131 

1278 

2116 

2564 

3104 

4006 


87 

73 

792 

1131 

1278 

1297 

2116 

2564 

3104 

4006 


88 

144 

1101 

1279 

1425 

1640 

2116 

2484 

2564 

4004 


89 

69 

859 

1101 

1640 

1762 

2484 

2564 

3226 

4003 


90 

73 

315 

792 

1303 

1849 

2564 

4004 

4006 

4045 


91 

73 

144 

315 

792 

1279 

1849 

2462 

2564 

4004 


92 

73 

315 

792 

1303 

1849 

2564 

4004 

4045 

4075 


93 

73 

827 

1131 

1279 

1639 

2242 

2564 

3465 

4003 


94 

69 

1312 

1425 

1640 

1762 

2116 

2564 

3104 

4004 


95 

69 

1312 

1425 

1528 

1640 

1762 

2116 

2564 

4004 


96 

73 

624 

792 

1131 

1278 

1849 

1855 

2564 

4004 


97 

144 

827 

1131 

1279 

1639 

2242 

2564 

3465 

4003 


98 

73 

792 

1131 

1279 

2027 

2116 

2564 

3104 

3288 

4004 

99 

73 

974 

1131 

1279 

1364 

1524 

2027 

2564 

3104 

4003 

LOO 

73 

859 

974 

1131 

1279 

1364 

1524 

2484 

2564 

4003 

LOl 

73 

624 

792 

1131 

1279 

1849 

2116 

2564 

3226 

4004 

102 

73 

1303 

1524 

1762 

2027 

2484 

2564 

3905 

4004 

4075 

103 

69 

974 

1425 

1524 

1640 

2116 

2484 

2564 

3288 

4004 

104 

69 

974 

1278 

1425 

1524 

1640 

2484 

2564 

3288 

4004 

105 

73 

315 

1279 

1294 

1762 

2027 

2462 

2564 

4003 

4075 

106 

69 

974 

1278 

1425 

1524 

1640 

2484 

2564 

3105 

4004 

107 

69 

974 

1425 

1524 

1640 

1857 

2484 

2564 

3288 

4004 

108 

73 

859 

1131 

1278 

1297 

1524 

2462 

2484 

2564 

4003 

109 

73 

1278 

1425 

1524 

1639 

2484 

2564 

3465 

4003 

4045 

LIO 

73 

792 

1131 

1279 

1478 

2116 

2484 

2564 

3465 

4006 

Lll 

73 

859 

1131 

1278 

1503 

1524 

2462 

2484 

2564 

4003 

L12 

73 

415 

859 

1131 

1278 

1503 

1524 

2484 

2564 

4003 



Predictor 

prec 

jant 

jult 

age65 

pphs 

educ 

fad 

dens 

nwht 

wtcl 

line 

HC 

NOx 

SUL 

hum 


Table 6: Potential predictors for mortality rate in Section 


4.2.2 


Description 

Mean annual precipitation in inches 
Mean January temperature in degrees F 
Mean July temperature in degrees F 
Percentage of population aged 65 or older 
Population per household 

Median school years completed by those over 22 

Percentage of housing units which are sound and with all facilities 

Population per square mile in urbanized areas 

Percentage of non-white population in urbanized areas 

Percentage of employed in white collar occupations 

Percentage of families with income < 3,000 dollars in urbanized 

areas 

Relative pollution potential of hydrocarbon 
Relative pollution potential of nitric oxides 
Relative pollution potential of sulphur dioxide 
Annual average percentage of relative humidity at 1pm 
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